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Abstract —Whereas deep neural networks were first mostly 
used for classification tasks, they are rapidly expanding in the 
realm of structured output problems, where the observed target 
is composed of multiple random variables that have a rich joint 
distribution, given the input. We focus in this paper on the case 
where the input also has a rich structure and the input and output 
structures are somehow related. We describe systems that learn 
to attend to different places in the input, for each element of the 
output, for a variety of tasks: machine translation, image caption 
generation, video clip description and speech recognition. All 
these systems are based on a shared set of building blocks: gated 
recurrent neural networks and convolutional neural networks, 
along with trained attention mechanisms. We report on exper¬ 
imental results with these systems, showing impressively good 
performance and the advantage of the attention mechanism. 

I. Introduction 

N this paper we focus on the application of deep learning 
to structured output problems where the task is to map the 
input to an output that possesses its own structure. The task is 
therefore not only to map the input to the correct output (e.g. 
the classification task in object recognition), but also to model 
the structure within the output sequence. 

A classic example of a structured output problem is ma¬ 
chine translation: to automatically translate a sentence from 
the source language to the target language. To accomplish 
this task, not only does the system need to be concerned 
with capturing the semantic content of the source language 
sentence, but also with forming a coherent and grammatical 
sentence in the target language. In other words, given an input 
source sentence, we cannot choose the elements of the output 
(i.e. the individual words) independently: they have a complex 
joint distribution. 

Structured output problems represent a large and important 
class of problems that include classic tasks such as speech 
recognition and many natural language processing problems 
(e.g. text summarization and paraphrase generation). As the 
range of capabilities of deep learning systems increases, less 
established forms of structured output problems, such as image 
caption generation and video description generation (ID and 
references therein,) are being considered. 

One important aspect of virtually all structured output tasks 
is that the structure of the output is imtimately related to the 
structure of the input. A central challenge to these tasks is 
therefore the problem of alignment. At its most fundamental, 
the problem of alignment is the problem of how to relate sub¬ 
elements of the input to sub-elements of the output. Consider 
again our example of machine translation. In order to translate 
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the source sentence into the target language we need to first 
decompose the source sentence into its constituent semantic 
parts. Then we need to map these semantic parts to their 
counterparts in the target language. Finally, we need to use 
these semantic parts to compose the sentence following the 
grammatical regularities of the target language. Each word or 
phrase of the target sentence can be aligned to a word or phrase 
in the source language. 

In the case of image caption generation, it is often appro¬ 
priate for the output sentence to accurately describe the spatial 
relationships between elements of the scene represented in the 
image. For this, we need to align the output words to spatial 
regions of the source image. 

In this paper we focus on a general approach to the 
alignment problem known as the soft attention mechanism. 
Broadly, attention mechanisms are components of prediction 
systems that allow the system to sequentially focus on different 
subsets of the input. The selection of the subset is typically 
conditioned on the state of the system which is itself a function 
of the previously attended subsets. 

Attention mechanisms are employed for two purposes. The 
first is to reduce the computational burden of processing high 
dimensional inputs by selecting to only process subsets of the 
input. The second is to allow the system to focus on distinct 
aspects of the input and thus improve its ability to extract the 
most relevant information for each piece of the output, thus 
yielding improvements in the quality of the generated outputs. 

As the name suggests, soft attention mechanisms avoid 
a hard selection of which subsets of the input to attend 
and instead uses a soft weighting of the different subsets. 
Since all subset are processed, these mechanisms offer no 
computation advantage. Instead, the advantage brought by 
the soft-weighting is that it is readily amenable to efficient 
learning via gradient backpropagation. 

In this paper, we present a review of the recent work 
in applying the soft attention to structured output tasks and 
spectulate about the future course of this line of research. The 
soft-attention mechanism is part of a growing litterature on 
more flexible deep learning architectures that embed a certain 
amount of distributed decision making. 

II. Background: 

Recurrent and Convolutional Neural Networks 
A. Recurrent Neural Network 

A recurrent neural network (RNN) is a neural network 
specialized at handling a variable-length input sequence x = 
(xi,...,xr) and optionally a corresponding variable-length 
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output sequence y = (yi ,... ,yx), using an internal hidden 
state h. The RNN sequentially reads each symbol x t of 
the input sequence and updates its internal hidden state lp 
according to 

h* = 4>e (hf_i, x t ), (1) 

where cf)g is a nonlinear activation function parametrized by 
a set of parameters 0. When the target sequence is given, the 
RNN can be trained to sequentially make a prediction y t of 
the actual output y t at each time step t: 

Yt = ge (h t ,x t ), (2) 

where gg may be an arbitrary, parametric function that is 
learned jointly as a part of the whole network. 

The recurrent activation function <f> in Eq. 0 may be as 
simple as an affine transformation followed by an element¬ 
wise logistic function such that 

h t = tanh (Uht_i + Wx t ), 

where U and W are the learned weight matrices^] 

It has recently become more common to use more sophisti¬ 
cated recurrent activation functions, such as a long short-term 
memory (LSTM, j2j) or a gated recurrent unit (GRU, O, |4|), 
to reduce the issue of vanishing gradient 0,0. Both LSTM 
and GRU avoid the vanishing gradient by introducing gating 
units that adaptively control the flow of information across 
time steps. 

The activation of a GRU, for instance, is defined by 

h t = u t © h t + (1 - u t ) © h f _i, 

where 0 is an element-wise multiplication, and the update 
gates u t are 

g t=cr (U u h t _i + W u x t ). 

The candidate hidden state h, is computed by 

h f = tanh (Uh t _i + W (r t © x t )), 
where the reset gates r t are computed by 

r* = cr (U r h t _i + W r x t ). 

All the use cases of the RNN in the remaining of this paper 
use either the GRU or LSTM. 

B. RNN-LM: Recurrent Neural Network Language Modeling 

In the task of language modeling, we let a model learn 
the probability distribution over natural language sentences. In 
other words, given a model, we can compute the probability of 
a sentence s = (w±, w 2 , ■ ■ ■, wt) consisting of multiple words, 
i.e., p(wi,W 2 , ■ ■ ■, wt), where the sentence is T words long. 

This task of language modeling is equivalent to the task 
of predicting the next word. This is clear by rewriting the 
sentence probability into 

T 

p(w!,w 2 , • ■ ■, W T ) = Y\_p( w t I w <t), (3) 

t~ t 

1 We omit biases to make the equations less cluttered. 


where ru <t = (wi,..., Wt-i). Each conditional probability 
on the right-hand side corresponds to the predictive prob¬ 
ability of the next word w t given all the preceding words 
(wi,.. .,W t -l). 

A recurrent neural network (RNN) can, thus, be readily used 
for language modeling by letting it predict the next symbol at 
each time step t (RNN-LM, (7J). In other words, the RNN 
predicts the probability over the next word by 

p(wt+i = w\w<t) = gg (h*, Wt), (4) 

where g$ returns the probability of the word w out of all 
possible words. The internal hidden state h t summarizes all 
the preceding symbols w<t = (w \,..., Wt). 

We can generate an exact sentence sample from an RNN- 
LM by iteratively sampling from the next word distribution 
p(w t +i\w<t) in Eq. ©• Instead of stochastic sampling, it is 
possible to approximately find a sentence sample that maxi¬ 
mizes the probability p(s) using, for instance, beam search 0, 
0. 

The RNN-LM described here can be extended to learn a 
conditional language model. In conditional language mod¬ 
eling, the task is to model the distribution over sentences 
given an additional input, or context. The context may be 
anything from an image and a video clip to a sentence in 
another language. Examples of textual outputs associated with 
these inputs by the conditional RNN-LM include respectively 
an image caption, a video description and a translation. In 
these cases, the transition function of the RNN will take as an 
additional input the context c such that 

h t = (j)g (h t _i, Xi,c). (5) 

Note the c at the end of the r.h.s. of the equation. 

This conditional language model based on RNNs will be at 
the center of later sections. 

C. Deep Convolutional Network 

A convolutional neural network (CNN) is a special type 
of a more general feedforward neural network, or multilayer 
perceptron, that has been specifically designed to work well 
with two-dimensional images ED- The CNN often consists 
of multiple convolutional layers followed by a few fully- 
connected layers. 

At each convolutional layer, the input image of width n t , 
height rij and c color channels (x £ jgmixn„xc) j s g rst 
convolved with a set of local filters f £ R n i x " w xcxd , p or 
each location/pixel (i, j) of x, we get 

n[ n'j 

Z i,j = VI YI / (fji (6) 

where f^j/ £ R cxd , Xi + xj + j> £ R c and Zjj £ / is an 

element-wise nonlinear activation function. 

The convolution in Eq. |6| is followed by local max-pooling: 

h ij = max (7) 

i' £ {ri,..., (r + l)i — 1} , 

f e {rj, ...,(r + l)j- 1} 
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for all i £ {1,..., rii/r} and j £ {1,, rij/r}. r is the size 
of the neighborhood. 

The pooling operation has two desirable properties. First, 
it reduces the dimensionality of a high-dimensional output 
of the convolutional layer. Furthermore, this spatial max¬ 
pooling summarizes the activation of the neighbouring feature 
activations, leading to the (local) translation invariance. 

After a small number of convolutional layers, the final 
feature map from the last convolutional layer is flattened to 
form a vector representation h of the input image. This vector 
h is further fed through a small number of fully-connected 
nonlinear layers until the output. 

Recently, the CNNs have been found to be excellent at 
the task of large-scale object recognition. For instance, the 
annual ImageNet Large Scale Visual Recognition Challenge 
(ILSVRC) has a classification track where more than a mil¬ 
lion annotated images with 1,000 classes are provided as a 
training set. In this challenge, the CNN-based entries have 
been dominant since 2012 DU, HU, O, d- 

D. Transfer Learning with Deep Convolutional Network 

Once a deep CNN is trained on a large training set such that 
the one provided as a part of the ILVRC challenge, we can 
use any intermediate representation, such as the feature map 
from any convolutional layer or the vector representation from 
any subsequent fully-connected layers, of the whole network 
for tasks other than the original classification. 

It has been observed that the use of these intermediate 
representation from the deep CNN as an image descriptor sig¬ 
nificantly boosts subsequent tasks such as object localization, 
object detection, fine-grained recognition, attribute detection 
and image retrieval (see, e.g., D3, ED) Furthermore, more 
non-trivial tasks, such as image caption generation D2, OS), 
da, ddi, ed, have been found to benefit from using the im¬ 
age descriptors from a pre-trained deep CNN. In later sections, 
we will discuss in more detail how image representations from 
a pre-trained deep CNN can be used in these non-trivial tasks 
such as image caption generation (22) and video description 
generation (23) . 

III. Attention-based Multimedia Description 

Multimedia description generation is a general task in 
which a model generates a natural language description of a 
multimedia input such as speech, image and video as well as 
text in another language, if we take a more general view. This 
requires a model to capture the underlying, complex mapping 
between the spatio-temporal structures of the input and the 
complicated linguistic structures in the output. In this section, 
we describe a neural network based approach to this problem, 
based on the encoder-decoder framework with the recently 
proposed attention mechanism. 

A. Encoder-Decoder Network 

An encoder-decoder framework is a general framework 
based on neural networks that aims at handling the mapping 
between highly structured input and output. It was proposed 


recently in (24), 0, lt25l in the context of machine translation, 
where the input and output are natural language sentences 
written in two different languages. 

As the name suggests, a neural network based on this 
encoder-decoder framework consists of an encoder and a 
decoder. The encoder / enc first reads the input data x into 
a continuous-space representation c: 

C = /encO), (8) 


The choice of / enc largely depends on the type of input. 
When a: is a two-dimensional image, a convolutional neural 
network (CNN) from Sec. |II-D may be used. A recurrent 
neural network (RNN) in Sec. II-A| is a natural choice when 
x is a sentence. 

The decoder then generates the output y conditioned on 
the continuous-space representation, or context c of the input. 
This is equivalent to computing the conditional probability 
distribution of y given x: 


p(Y\x) = /dec(c). 


(9) 


Again, the choice of /dec is made based on the type of the 
output. For instance, if y is an image or a pixel-wise image 
segmentation, a conditional restricted Boltzmann machine 
(CRBM) can be used (26) . When y is a natural language 
description of the input x, it is natural to use an RNN which 


is able to model natural languages, as described in Sec. II-B 


Decoder 



Fig. 1. Graphical illustration of the simplest form encoder-decoder model 
for machine translation from a. x = (xi,.. .,x T ), y = (yi, ■ • • ,y T ') and 
c are respectively the input sentence, the output sentence and the continuous- 
space representation of the input sentence. 


This encoder-decoder framework has been successfully 
used in (25), 0 for machine translation. In both work, an 
RNN was used as an encoder to summarize a source sentence 
(where the summary is the last hidden state h/’ in Eq. 0 ) 
from which a conditional RNN-LM from Sec. III-AI decoded 
out the corresponding translation. See Fig. [T] for the graphical 
illustration. 

In D3, EO), the authors used a pre-trained CNN as an 
encoder and a conditional RNN as a decoder to let model 
generate a natural language caption of images. Similarly, a 
simpler feedforward log-bilinear language model E3 was 
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used as a decoder in ED- The authors of ll28ll applied the 
encoder-decoder framework to video description generation, 
where they used a pre-trained CNN to extract a feature vector 
from each frame of an input video and averaged those vectors. 

In all these recent applications of the encoder-decoder 
framework, the continuous-space representation c of the input 
x returned by an encoder, in Eq. has been a fixed- 
dimensional vector, regardless of the size of the input]^ Fur¬ 
thermore, the context vector was not structured by design, but 
rather an arbitrary vector, which means that there is no guar¬ 
antee that the context vector preserves the spatial, temporal or 
spatio-temporal structures of the input. Henceforth, we refer 
to an encoder-decoder based model with a fixed-dimensional 
context vector as a simple encoder-decoder model. 


B. Incorporating an Attention Mechanism 

1) Motivation: A naive implementation of the encoder- 
decoder framework, as in the simple encoder-decoder model, 
requires the encoder to compress the input into a single vector 
of predefined dimensionality, regardless of the size of or the 
amount of information in the input. For instance, the recurrent 
neural network (RNN) based encoder used in 0, ED for 
machine translation needs to be able to summarize a variable- 
length source sentence into a single fixed-dimensional vector. 
Even when the size of the input is fixed, as in the case of a 
fixed-resolution image, the amount of information contained in 
each image may vary significantly (consider a varying number 
of objects in each image). 

In (29), it was observed that the performance of the neural 
machine translation system based on a simple encoder-decoder 
model rapidly degraded as the length of the source sentence 
grew. The authors of |291 hypothesized that it was due to 
the limited capacity of the simple encoder-decoder’s fixed- 
dimensional context vector. 

Furthermore, the interpretability of the simple encoder- 
decoder is extremely low. As all the information required for 
the decoder to generate the output is compressed in a context 
vector without any presupposed structure, such structure is not 
available to techniques designed to inspect the representations 
captured by the model fT~2l . |30l . |3T1 . 

2) Attention Mechanism for Encoder-Decoder Models: We 
the introduction of an attention mechanism in between the 
encoder and decoder, we address these two issues, i.e., (1) 
limited capacity of a fixed-dimensional context vector and (2) 
lack of interpretability. 

The first step into introducing the attention mechanism to 
the encoder-decoder framework is to let the encoder return 
a structured representation of the input. We achieve this by 
allowing the continuous-space representation to be a set of 
fixed-size vectors, to which we refer as a context set, i.e., 

c {ci, C 2 ,..., Cm } 

See Eq. Each vector in the context set is localized to 
a certain spatial, temporal or spatio-temporal component of 
the input. For instance, in the case of an image input, each 

2 Note that in the case of machine translation and video description 
generation, the size of the input varies. 


context vector c, will summarize a certain spatial location 


of the image (see Sec. IV-B I, and with machine translation, 
each context vector will summarize a phrase centered around 


a specific word in a source sentence (see Sec. IV-A) In all 
cases, the number of vectors M in the context set c may vary 
across input examples. 

The choice of the encoder and of the kind of context set it 
will return is governed by the application and the type of the 
input considered. In this paper, we assume that the decoder 


is a conditional RNN-LM from Sec. II-B i.e., the goal is to 


describe the input in a natural language sentence. 

The attention mechanism controls the input actually seen 
by the decoder and requires another neural network, to which 
refer as the attention model. The main job of the attention 
model is to score each context vector c, with respect to the 
current hidden state z t -i of the decoder^ 


e* = / AT r(z t - 1 ,c i ,{a- i }f =1 ) 


( 10 ) 


where cT 1 represents the attention weights computed at the 


previous time step, from the scores e, 
that makes them sum to 1: 

exp(e*) 


t~ 1 


<x- = 


EL e xp (e!-)’ 


through a softmax 


( 11 ) 


This type of scoring can be viewed as assigning a probability 
of being attended by the decoder to each context, hence the 
name of the attention model. 

Once the attention weights are computed, we use them to 
compute the new context vector c f : 


c = <p 




M 
* 0=1 


( 12 ) 


where p returns a vector summarizing the whole context set 
c according to the attention weights. 

A usual choice for p is a simple weighted sum of the context 
vectors such that 


c‘ = 


M 

v ({ c *h=i, WDtLi) =J2 aiCi - (13) 


On the other hand, we can also force the attention model to 
make a hard decision on which context vector to consider by 
sampling one of the context vectors following a categorical 
(or multinoulli) distribution: 


c* = c r t, where r* ~ Cat(M, (a*}. =1 ). (14) 

With the newly computed context vector c t , we can update 
the hidden state of the decoder, which is a conditional RNN- 
LM here, by 


h t = (j> 8 (h t _i,x t ,c t ). 


(15) 


This way of computing a context vector at each time step 
t of the decoder frees the encoder from compressing any 
variable-length input into a single fixed-dimensional vector. 
By spatially or temporally dividing the inpuj^] the encoder can 

3 We use zt to denote the hidden state of the decoder to distinguish it from 
the encoder’s hidden state for which we used h/ in Eq. |TJ. 

4 Note that it is possible, or even desirable to use overlapping regions. 






5 


L' 

accord 
sur 
la 

zone 
economique 
europeenne 
a 
ete 
signe 
en 
aout 
1992 

<end> 

Fig. 2. Visualization of the attention weights a*- of the attention-based neural 
machine translation model ED Each row corresponds to the output symbol, 
and each column the input symbol. Brighter the higher a*. 

represent the input into a set of vectors of which each needs 
to encode a fixed amount of information focused around a 
particular region of the input. In other words, the introduction 
of the attention mechanism bypasses the issue of limited 
capacity of a fixed-dimensional context vectors. 

Furthermore, this attention mechanism allows us to directly 
inspect the internal working of the whole encoder-decoder 
model. The magnitude of the attention weight a*, which is 
positive by construction in Eq. ( fTT| t, highly correlates with 
how predictive the spatial, temporal or spatio-temporal region 
of the input, to which the j-th context vector corresponds, is 
for the prediction associated with the f-th output variable y t . 
This can be easily done by visualizing the attention matrix 
e M T ' xM , as in Fig. |2j 

This attention-based approach with the weighted sum of 
the context vectors (see Eq. ( fl3| ) was originally proposed in 
||32| in the context of machine translation, however, with a 
simplified (content-based) scoring function: 

e- = /ATT(z f -l,Ci). (16) 

See the missing {a*- _1 }j£, 1 from Eq. ( flQ| ). In j22l . it was 
further extended with the hard attention using Eq. ( fl4| ). In ll33l 
this attention mechanism was extended to be by taking intou 
account the past values of the attention weights as the general 
scoring function from Eq. 0, following an approach based 
purely on those weights introduced by )34!|. We will discuss 
more in detail these three applications/approaches in the later 
sections. 

C. Learning 

As usual with many machine learning models, the attention- 
based encoder-decoder model is also trained to maximize 
the log-likelihood of a given training set with respect to the 


parameters, where the log-likelihood is defined as 

1 N 

C[D = {(*", y n )}ti = l °ZP(y n I *"> 0 )> 

n—1 

(17) 

where 0 is a set of all the trainable parameters of the model. 

1) Maximum Likelihood Learning: When the weighted sum 
is used to compute the context vector, as in Eq. ( fj~3) l, the whole 
attention-based encoder-decoder model becomes one large 
differentiable function. This allows us to compute the gradient 
of the log-likelihood in Eq. 0 using backpropagation [35!|. 
With the computed gradient, we can use, for instance, the 
stochastic gradient descent (SGD) algorithm to iteratively 
update the parameters 0 to maximize the log-likelihood. 

2 ) Variational Learning for Hard Attention Model: When 
the attention model makes a hard decision each time as 
in Eq. 0, the derivatives through the stochastic decision 
are zero, because those decisions are discrete. Hence, the 
information about how to improve the way to take those focus- 
of-attention decisions is not available from back-propagation, 
while it is needed to train the attention mechanism. The 
question of training neural networks with stochastic discrete¬ 
valued hidden units has a long history, starting with Boltzmann 
machines |f36l , with recent work studying how to deal with 
such units in a system trained using back-propagated gradients 
0 , nm ei, 0 . Here we briefly describe the variational 
learning approach from El- ED- 

With stochastic variables r involved in the computation from 
inputs to outputs, the log-likelihood in Eq. 0 is re-written 
into 

1 N 

C(D = {Or", y")}li , e) = - ^ x", 0), 

n—1 

where 

l(y,x,Q) = log^p(y,r|x,0) 

r 

and r = (n, X 2 ,..., r' T ). We derive a lowerbound of l as 
l(y,x) = log^p(y|r,x)p(r|x) 

r 

> ^p( r |x)l°gp(y| r ,x). (18) 

r 

Note that we omitted 0 to make the equation less cluttered. 
The gradient of l with respect to 0 is then 

V%,x) = ^p(r|x) [V logp(y|r, x) 

r 

+ logp(y|r,x)Vlogp(r|x)] (19) 
which is often approximated by Monte Carlo sampling: 

1 M 

V%,x) w — ^Vlogp(y|r m ,x) 

771 = 1 

+ logp(y|r m ,x)Vlogp(r m |x). (20) 

As the variance of this estimator is high, a number of variance 
reduction techniques, such as baselines and variance normal¬ 
ization, are often used in practice ED, El- 
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Once the gradient is estimated, any usual gradient-based 
iterative optimization algorithm can be used to approximately 
maximize the log-likelihood. 

IV. Applications 

In this section, we introduce some of the recent work in 
which the attention-based encoder-decoder model was applied 
to various multimedia description generation tasks. 


the graphical illustration of the attention-based neural machine 
translation model. 

TABLE I 

The translation performances and the relative improvements 

OVER THE SIMPLE ENCODER-DECODER MODEL ON AN 

English-to-French translation task, measured by BLEU l32l . 
(42). *: an ensemble of multiple attention-based MODELS, o: the 
STATE-OF-THE-ART PHRASE-BASED STATISTICAL MACHINE TRANSLATION 
SYSTEM (43). 


A. Neural Machine Translation 


Machine translation is a task in which a sentence in one 
language (source) is translated into a corresponding sentence 
in another language (target). Neural machine translation aims 
at solving it with a single neural network based model, jointly 
trained end-to-end. The encoder-decoder framework described 


in Sec. III-A was proposed for neural machine translation 
recently in 124) . l3l , (25). Based on these works, in |[32l . the 
attention-based model was proposed to make neural machine 
translation systems more robust to long sentences. Here, we 
briefly describe the model from |[32l . 

1) Model Description: The attention-based neural machine 
translation in l(32l uses a bidirectional recurrent neural network 
(BiRNN) as an encoder. The forward network reads the input 
sentence x = (xi,..., xt) from the first word to the last, 
resulting in a sequence of state vectors 




The backward network, on the other hand, reads the input 
sentence in the reverse order, resulting in 




These ve ctors ar e concatenated per step to form a context set 
(see Sec. III-B21 such that c t = h t ; h t . 


y t -i y t 


Fig. 3. Illustration of a single 
step of decoding in attention-based 
neural machine translation (32) 


X 1 X 2 X 3 X T 

The use of the BiRNN is crucial if the content-based 
attention mechanism is used. The content-based attention 
mechanism in Eqs. © and relies solely on a so-called 
content-based scoring, and without the context information 
from the whole sentence, words that appear multiple times 
in a source sentence cannot be distinguished by the attention 
model. 

The decoder is a conditional RNN-LM that models the 
target language given the context set from above. See Fig. [3]for 



Model 

BLEU 

Rel. Improvement 

Simple Enc-Dec 

17.82 

- 

Attention-based Enc-Dec 

28.45 

+59.7% 

Attention-based Enc-Dec (LV) 

34.11 

+90.7% 

Attention-based Enc-Dec (LV)* 

37.19 

+106.0% 

State-of-the-art SMT 0 

37.03 

- 


2) Experimental Result: Given a fixed model size, the 
attention-based model proposed in ll32l was able to achieve 
a relative improvement of more than 50% in the case of the 
English-to-French translation task, as shown in Table [I] When 
the very same model was extended with a very large target 
vocabulary |42) . the relative improvement over the baseline 
without the attention mechanism was 90%. Additionally, the 
very same model was recently tested on a number of European 
language pairs at the WMT’ 15 Translation Task J^] See Table [rf] 
for the results. 

The authors of m recently proposed a method for in¬ 
corporating a monolingual language model into the attention- 
based neural machine translation system. With this method, the 
attention-based model was shown to outperform the existing 
statistical machine translation systems on Chinese-to-English 
(restricted domains) and Turkish-to-English translation tasks 
as well as other European languages they tested. 


B. Image Caption Generation 

Image caption generation is a task in which a model looks 
at an input image and generates a corresponding natural 
language description. The encoder-decoder framework fits 
well with this task. The encoder will extract the continuous- 
space representation, or the context, of an input image, for 


instance, with a deep convolutional network (see Sec. II-C ) 
and from this representation the conditional RNN-LM based 
decoder generates a natural language description of the image. 
Very recently (Dec 2014), a number of research groups inde¬ 
pendently proposed to use the simple encoder-decoder model 
to solve the image caption generation lil8l , (17) , fl9) . |20) . 


: http://www.statmt.org/wmtl5/ 


TABLE II 

THE PERFORMANCE OF THE ATTENTION-BASED NEURAL MACHINE 
TRANSLATION MODELS WITH THE VERY LARGE TARGET VOCABULARY IN 

the WMT’ 15 Translation Track (42). We show the results on 

TWO REPRESENTATIVE LANGUAGE PAIRS. FOR THE COMPLETE RESULT, 

see http://matrix.statmt.org/ 


Language Pair 

Model 

BLEU 

Note 

En->De 

NMT 

Best Non-NMT 

24.8 

24.0 

Syntactic SMT (Edinburgh) 

En->Cz 

NMT 

Best Non-NMT 

18.3 

18.2 

Phrase SMT (IHU) 
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Instead, here we describe a more recently proposed approach 
based on the attention-based encoder-decoder framework in 



Fig. 4. Graphical illustration of the attention-based encoder-decoder model 
for image caption generation. 


1) Model Description: The usual encoder-decoder based 
image caption generation models use the activation of the 
last fully-connected hidden layer as the continuous-space 
representation, or the context vector, of the input image (see 
Sec. II-D ) The authors of (22l however proposed to use the 
activation from the last convolutional layer of the pre-trained 
convolutional network, as in the bottom half of Fig. [4] 

Unlike the fully-connected layer, in this case, the context set 
consists of multiple vectors that correspond to different spatial 
regions of the input image on which the attention mechanism 
can be applied. Furthermore, due to convolution and pooling, 
the spatial locations in pixel space represented by each con¬ 
text vector overlaps substantially with those represented by 
the neighbouring context vectors, which helps the attention 
mechanism distinguish similar objects in an image using its 
context information with respect to the whole image, or the 
neighbouring pixels. 

Similarly to the attention-based neural machine translation 
in Sec. IV-A the decoder is implemented as a conditional 
RNN-LM. In (22 1 , the content-based attention mechanism (see 
Eq. with either the weighted sum (see Eq. m or 

hard decision (see Eq. ( [T4| was tested by training a model 
with the maximum likelihood estimator from Sec. IIII-C1I and 
the variational learning from Sec. III-C2 respectively. The 
authors of (22) reported the similar performances with these 
two approaches on a number of benchmark datasets. 

2) Experimental Result: In l22l . the attention-based image 
caption generator was evaluated on three datasets; Flickr 
8K [EH, Flickr 30K (48) and MS CoCo (49]. In addition to 
the self-evaluation, an ensemble of multiple attention-based 
models was submitted to Microsoft COCO Image Captioning 
Challeng^] and evaluated with multiple automatic evaluation 
metrical as well as by human evaluators. 


'https://www.codalab.org/competitions/3221 
1 BLEU (50), METEOR (5TJ, ROUGE-L (52) and CIDEr (53) 


TABLE IE 

The performances of the image caption generation models in 
the Microsoft COCO Image Captioning Challenge. (*) f20l . (•) 
HU, (°) 1431 . (o) (46l and (*) (22). The rows are sorted 
according to Ml. 


Model 

Human 

Ml M2 

Automatic 
BLEU CIDEr 

Human 

0.638 

0.675 

0.471 

0.91 

Google* 

0.273 

0.317 

0.587 

0.946 

MSR* 

0.268 

0.322 

0.567 

0.925 

Attention-based* 

0.262 

0.272 

0.523 

0.878 

Captivator 0 

0.250 

0.301 

0.601 

0.937 

Berkeley LRCN 0 

0.246 

0.268 

0.534 

0.891 


In this Challenge, the attention-based approach ranked third 
based on the percentage of captions that are evaluated as better 
or equal to human caption (Ml) and the percentage of captions 
that pass the Turing Test (M2). Interestingly, the same model 
was ranked eighth according to the most recently proposed 
metric of CIDEr and ninth according to the most widely used 
metric of BLEU0It means that this model has better relative 
performance in terms of human evaluation than in terms of the 
automatic metrics, which only look at matching subsequences 
of words, not directly at the meaning of the generated sentence. 
The performance of the top-ranked systems, including the 
attention-based model from (22) . are listed in Table m 

The attention-based model was further found to be highly 
interpretable, especially, compared to the simple encoder- 
decoder models. See Fig. [5] for some examples. 

C. Video Description Generation 

Soon after the neural machine translation based on the 
simple encoder-decoder framework was proposed in (25) . 
0, it was further applied to video description generation, 
which amounts to translating a (short) video clip to its natural 
language description (28) . The authors of (28) used a pre¬ 
trained convolutional network (see Sec. to extract a 

feature vector from each frame of the video clip and average all 
the frame-specific vectors to obtain a single fixed-dimensional 
context vector of the whole video. A conditional RNN-LM 
from Sec. |II-B| was used to generate a description based on 
this context vector. 

Since any video clip clearly has both temporal and spatial 
structures, it is possible to exploit them by using the attention 
mechanism described throughout this paper. In (23) . the au¬ 
thors proposed an approach based on the attention mechanism 
to exploit the global and local temporal structures of the video 
clips. Here we briefly describe their approach. 

1) Model Description: In (23) . two different types of 
encoders are tested. The first one is a simple frame-wise 
application of the pre-trained convolutional network. However, 
they did not pool those per-frame context vectors as was done 
in (28) . but simply form a context set consisting of all the per- 
frame feature vectors. The attention mechanism will work to 
select one of those per-frame vectors for each output symbol 
being decoded. In this way, the authors claimed that the overall 
model captures the global temporal structure (the structure 
across many frames, potentially across the whole video clip.) 

' http://mscoco.org/dataset/#leaderboard-cap 


















































A woman is throwing a frisbee in a park. 


A dog is standing on a hardwood floor. 


A stop sign is on a road with a 
mountain in the background. 



A little girl sitting on a bed with 
a teddy bear. 


A group of people sitting on a boat 
in the water. 


A giraffe standing in a forest with 
trees in the background. 


Fig. 5. Examples of the attention-based model attending to the correct object {white indicates the attended regions, underlines indicated the corresponding 
word) ED 



TABLE TV 

The performance of the video description generation models 
on Youtube2Text and Montreal DVS. (*) Higher the better. 
(o) Lower the better. 


Model 

Youtube2Text 
METEOR* Perplexity 0 

Montreal DVS 
METEOR Perplexity 

Enc-Dec 

0.2868 

33.09 

0.044 

88.28 

+ 3-D CNN 

0.2832 

33.42 

0.051 

84.41 

+ Per-frame CNN 

0.2900 

27.89 

.040 

66.63 

+ Both 

0.2960 

27.55 

0.057 

65.44 


Fig. 6. The 3-D convolutional network for motion from HU. 


The other type of encoder in 1 1231 is a so-called 3-D 
convolutional network, shown in Fig. [6] Unlike the usual 
convolutional network which often works only spatially over a 
two-dimensional image, the 3-D convolutional network applies 
its (local) filters across the spatial dimensions as well as the 
temporal dimensions. Furthermore, those filters work not on 
pixels but on local motion statistics, enabling the model to 
concentrate on motion rather than appearance. Similarly to 
the strategy from Sec. m the model was trained on larger 
video datasets to recognize an action from each video clip, and 
the activation vectors from the last convolutional layer were 
used as context. The authors of Il23l suggest that this encoder 
extracts more local temporal structures complementing the 
global structures extracted from the frame-wise application of 
a 2-D convolutional network. 

The same type of decoder, a conditional RNN-LM, used in 
m was used with the content-based attention mechanism in 

Eq. <[T§. 

2) Experimental Result: In Il23l . this approach to video 
description generation has been tested on two datasets; (1) 
Youtube2Text ||54| and (2) Montreal DVS ll55l . They showed 
that it is beneficial to have both types of encoders together 
in their attention-based encoder-decoder model, and that 
the attention-based model outperforms the simple encoder- 
decoder model. See Table IV for the summary of the evalua¬ 
tion. 


Similarly to all the other previous applications of the 
attention-based model, the attention mechanism applied to the 
task of video description also provides a straightforward way 
to inspect the inner workings of the model. See Fig. [7] for 
some examples. 



+Local+Global: A man and a woman are talking on the 
Ref: A man and a woman ride a motorcycle 



+Local+Global: Someone is frying a fish in a 


Ref: A woman is frying food 

Fig. 7. Two sample videos and their corresponding generated and ground- 
truth descriptions from Youtube2Text. The bar plot under each frame cor¬ 
responds to the attention weight a*, (see Eq. jTTJ) for the frame when the 
corresponding word (color-coded) was generated. Reprinted from ED. 
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D. End-to-End Neural Speech Recognition 

Speech recognition is a task in which a given speech 
waveform is translated into a corresponding natural language 
transcription. Deep neural networks have become a standard 
for the acoustic part of speech recognition systems lf56l . Once 
the input speech (often in the form of spectral filter response) 
is processed with the deep neural network based acoustic 
model, another model, almost always a hidden Markov model 
(HMM), is used to map correctly the much longer sequence of 
speech into a shorter sequence of phonemes/characters/words. 
Only recently, in ll57l , l8l , ll58l . l59l , fully neural network 
based speech recognition models were proposed. 

Here, we describe the recently proposed attention-based 
fully neural speech recognizer from lf33l . For more detailed 
comparison between the attention-based fully speech recog¬ 
nizer and other neural speech recognizers, e.g., from ||58l , we 
refer the reader to 8331 . 

1) Model Description-Hybrid Attention Mechanism: The 
basic architecture of the attention-based model for speech 
recognition in 8331 is similar to the other attention-based 
models described earlier, especially the attention-based neural 
machine translation model in Sec. IIV-AI The encoder is a 
stacked bidirectional recurrent neural network (BiRNN) l60l 
which reads the input sequence of speech frames, where each 
frame is a 123-dimensional vector consisting of 40 Mel-scale 
filter-bank response, the energy and first- and second-order 
temporal differences. The context set of the concatenated 
hidden states from the top-level BiRNN is used by the 
decoder based on the conditional RNN-LM to generate the 
corresponding transcription, which in the case of 8331 . consists 
in a sequence of phonemes. 

The authors of l33l however noticed the peculiarity of 
speech recognition compared to, for instance, machine trans¬ 
lation. First, the lengths of the input and output differ sig¬ 
nificantly; thousands of input speech frames against a dozen 
of words. Second, the alignment between the symbols in the 
input and output sequences is monotonic, where this is often 
not true in the case of translation. 

These issues, especially the first one, make it diffi¬ 
cult for the content-based attention mechanism described in 
Eqs. ( fl6| and ( fTT| > to work well. The authors of l33l in¬ 
vestigated these issues more carefully and proposed that the 
attention mechanism with location awareness are particulary 
appropriate (see Eq. The location awareness in this case 
means that the attention mechanism directly takes into account 
the previous attention weights to compute the next ones. 

The proposed location-aware attention mechanism scores 
each context vector by 

e* = /ATT(zt-i,Ci,/Lo C ({a‘ _1 } i=1 ), 

where f^ oc is a function that extracts information from the 
previous attention weights {a* -1 } for the i-th context vector. 
In other words, the location-aware attention mechanism takes 
into account both the content c, and the previous attention 
weights {a* _1 } j=1 . 


In 8331 . /l OC was implemented as 

i+ — 

floc({ a j}) = VfcQ T > (^l) 

k=j -f 

where K is the size of the window, and v l: £ is a learned 
vector. 

Furthermore, the authors of 8331 proposed additional mod¬ 
ifications to the attention mechanism, such as sharpening, 
windowing and smoothing, which modify Eq. HD- For more 
details of each of these, we refer the reader to [ 33 ]. 

2) Experimental Result: In 8331 . this attention-based speech 
recognizer was evaluated on the widely-used TIMIT cor¬ 
pus EQ, closely following the procedure from 8621 . As can 
be seen from Table [V] the attention-based speech recognizer 
with the location-aware attention mechanism can recognize a 
sequence of phonemes given a speech segment can perform 
better than the conventional fully neural speech recognition. 
Also, the location-aware attention mechanism helps the model 
achieve better generalization error. 

TABLE V 

Phoneme error rates (PER). The bold-faced PER corresponds 

TO THE BEST ERROR RATE ACHIEVED WITH A FULLY NEURAL NETWORK 
BASED MODEL. FROM (33). 


Model 

Dev 

Test 

Attention-based Model 

Attention-based Model + Location-Awareness 

15.9% 

15.8% 

18.7% 

17.6% 

RNN Transducer |62| 

N/A 

17.7% 

Time/Frequency Convolutional Net+HMM 1631 

13.9% 

16.7% 


Similarly to the previous applications, it is again possible 
to inspect the model’s behaviour by visualizing the attention 
weights. An example is shown in Fig. [8] where we can clearly 
see how the model attends to a roughly correct window of 
speech each time it generates a phoneme. 

E. Beyond Multimedia Content Description 

We briefly present three recent works which applied the 
described attention-based mechanism to tasks other than mul¬ 
timedia content description. 

1) Parsing-Grammar as a Foreign Language: Parsing a 
sentence into a parse tree can be considered as a variant of 
machine translation, where the target is not a sentence but its 
parse tree. In 8641 , the authors evaluate the simple encoder- 
decoder model and the attention-based model on generating 
the linearized parse tree associated with a natural language 
sentence. Their experiments revealed that the attention-based 
parser can match the existing state-of-the-art parsers which are 
often highly domain-specific. 

2) Discrete Optimization-Pointer Network: In [651, the at¬ 
tention mechanism was used to (approximately) solve discrete 
optimization problems. Unlike the usual use of the described 
attention mechanism where the decoder generates a sequence 
of output symbols, in their application to discrete optimization, 
the decoder predicts which one of the source symbols/nodes 
should be chosen at each time step. The authors achieve this 
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FDHC0SX209: Michael colored the bedroom wall with crayons. 
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Fig. 8. Attention weights by the attention-based model with location-aware attention mechanism. The vertical bars indicate ground-truth phone location. 
For more details, see ED. 


by considering a\ as the probability of choosing the /-th input 
symbol as the selected one, at each time step t. 

For instance, in the case of travelling salesperson problem 
(TSP), the model needs to generate a sequence of cities/nodes 
that cover the whole set of input cities so that the sequence will 
be the shortest possible route in the input map (a graph of the 
cities) to cover every single city/node. First, the encoder reads 
the graph of a TSP instance and returns a set of context vectors, 
each of which corresponds to a city in the input graph. The 
decoder then returns a sequence of probabilities over the input 
cities, or equivalently the context vectors, which are computed 
by the attention mechanism. The model is trained to generate 
a sequence to cover all the cities by correctly attending to each 
city using the attention mechanism. 

As was shown already in 1551 . this approach can be ap¬ 
plied to any discrete optimization problem whose solution is 
expressed as a subset of the input symbols, such as sorting. 

3) Question Answering-Weakly Supervised Memory Net¬ 
work: The authors of ll66l applied the attention-based model 
to a question-answering (QA) task. Each instance of this QA 
task consists of a set of facts and a question, where each fact 
and the question are both natural language sentences. Each fact 
is encoded into a continuous-space representation, forming a 
context set of fact vectors. The attention mechanism is applied 
to the context set given the continuous-space representation of 
the question so that the model can focus on the relevant facts 
needed to answer the question. 

V. Related Work: Attention-based Neural 
Networks 

The most related, relevant model is a neural network 
with location-based attention mechanism, as opposed to the 
content-based attention mechanism described in this paper. 
The content-based attention mechanism computes the rele¬ 
vance of each spatial, temporal or spatio-temporally localized 
region of the input, while the location-based one directly 
returns to which region the model needs to attend, often in 
the form of the coordinate such as the (, x , t/)-coordinate of an 
input image or the offset from the current coordinate. 

In l34l . the location-based attention mechanism was suc¬ 
cessfully used to model and generate handwritten text. In 
USD, Ell, a neural network is designed to use the location- 
based attention mechanism to recognize objects in an image. 
Furthermore, a generative model of images was proposed in 


|[68l . which iteratively reads and writes portions of the whole 
image using the location-based attention mechanism. Earlier 
works on utilizing the attention mechanism, both content- 
based and location-based, for object recognition/tracking can 
be found in J69|, (7Q|, 117111 . 

The attention-based mechanim described in this paper, or its 
variant, may be applied to something other than multimedia 
input. For instance, in ll72l . a neural Turing machine was 
proposed, which implements a memory controller using both 
the content-based and location-based attention mechanisms. 
Similarly, the authors of [731 used the content-based attention 
mechanism with hard decision (see, e.g., Eq. {I3> to find 
relevant memory contents, which was futher extended to the 
weakly supervised memory network in l66l in Sec. 


IV-E3 


VI. Looking Ahead... 


In this paper, we described the recently proposed attention- 
based encoder-decoder architecture for describing multimedia 
content. We started by providing background materials on 
recurrent neural networks (RNN) and convolutional networks 
(CNN) which form the building blocks of the encoder-decoder 
architecture. We emphasized the specific variants of those 
networks that are often used in the encoder-decoder model; 
a conditional language model based on RNNs (a conditional 
RNN-LM) and a pre-trained CNN for transfer learning. Then, 
we introduced the simple encoder-decoder model followed by 
the attention mechanism, which together form the central topic 
of this paper, the attention-based encoder-decoder model. 

We presented four recent applications of the attention-based 
encoder-decoder models; machine translation (Sec. |IV-A| ), 
image caption generation (Sec.|IV-B|i, video description gener¬ 
ation (Sec. IV-C[ ) and speech recognition (Sec. IV-D| >. We gave 
a concise description of the attention-based model for each 
of these applications together with the model’s performance 
on benchmark datasets. Furthermore, each description was 
accompanied with a figure visualizing the behaviour of the 
attention mechanism. 

In the examples discussed above, the attention mechanism 
was primarily considered as a means to building a model that 
can describe the input multimedia content in natural language, 
meaning the ultimate goal of the attention mechanism was 
to aid the encoder-decoder model for multimedia content 
description. However, this should not be taken as the only 
possible application of the attention mechanism. Indeed, as 
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recent work such as the pointer networks j65l suggests, future 
applications of attention mechanisms could run the range of 
Al-related tasks. 

Beside superior performance it delivers, an attention mech¬ 
anism can be used to extract the underlying mapping between 
two entirely different modalities without explicit supervision 
of the mapping. From Figs. 000 and HI it is clear that the 
attention-based models were able to infer - in an unsuperivsed 
way - alignments between different modalities (multimedia 
and its text description) that agree well with our intuition. This 
suggests that this type of attention-based model can be used 
solely to extract these underlying, often complex, mappings 
from a pair of modalities, where there is not much prior/- 
domain knowledge. As an example, attention-based models 
can be used in neuroscience to temporally and spatially map 
between the neuronal activities and a sequence of stimuli [SI- 
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